Homework 1
Required libraries
Use the FiveThirtyEight presidential elections data to answer the following questions about the 2020 general election results.
url<-"https://raw.githubusercontent.com/fivethirtyeight/election-results/main/election_results_presidential.csv"
presidential_elections<-read_csv(url)
voteshares <-presidential_elections|>
filter(office_name=="U.S. President" & cycle==2020 & stage=="general")|>
filter(candidate_name %in% c("Joe Biden", "Donald Trump"))|>
filter(!str_detect(state, " CD-\\d+"))|>
group_by(state, candidate_name)|>
summarise(votes=sum(votes,na.rm =T))|>
pivot_wider(names_from = candidate_name, values_from = votes)|>
mutate(`Biden Share`=`Joe Biden`/(`Joe Biden`+`Donald Trump`))Question 1
Use the following code to download demographics for each state from the American Community Survey and then use a join to add this column to your data.
Answer
acs_data <- get_acs(geography = "state",
variables = c(median_income = "B19013_001",
bachelors_degree = "B15003_022",
pop = "B01001_001",
white_pop = "B02001_002"
),
year = 2020)
voteshares<-acs_data|>
select(-moe)|>
pivot_wider(names_from = variable, values_from=estimate)|>
right_join(voteshares, by = join_by(NAME == state))(note: you can ignore the “moe” column and just use the estimates. You’ll need to use pivot_wider to put these into wide format with one row per state)
Question 2
Run a linear regression to calculate the effect of median income on Biden’s statewide two party vote share. Produce a formatted table to display your results and briefly discuss your findings.
Answer
The basic code here is straightforward, but there are some things you can do to improve the formatting:
- One problem that comes up a lot is that the coefficient values on things like median income will be so tiny that they show up as 0.00 in your model output, even though they actually have a substantial effect on your dependent variable. Where it makes sense to do so, you can measure values like this in larger increments, like $1,000 or even $10,000. This won’t impact your actual results. You can do the same thing for the dependent variable to convert a proportion to a percentage.
- You should generally rename covariates like
median_incometo something more descriptive. R has lots of limitations on how you can name variables, but those limitations don’t limit what you can show in your model output. - If applicable, add a note to indicate that your output shows standard errors in parentheses.
- It might make sense to round some of your statistics to the nth decimal place.
- In some cases you might also consider mean-centering the independent variables. This will give you an interpretable intercept term.
# Q2
library(modelsummary)
voteshares<-voteshares|>
mutate(income10k = median_income/10000,
biden_percent = `Biden Share` * 100,
)
model1<-lm(biden_percent ~ income10k, data= voteshares)
modelsummary(list("DV: % Biden vote in 2020 election" = model1),
coef_rename = c(income10k = "Median income ($10,000)"),
title = "% statewide Biden vote in 2020 by median income",
fmt = fmt_statistic(estimate = 2, std.error=2), # rounding estimates and se values
notes = "Std. err in parentheses"
)| DV: % Biden vote in 2020 election | |
|---|---|
| Std. err in parentheses | |
| (Intercept) | -2.20 |
| (7.33) | |
| Median income ($10,000) | 7.98 |
| (1.11) | |
| Num.Obs. | 51 |
| R2 | 0.513 |
| R2 Adj. | 0.503 |
| AIC | 369.1 |
| BIC | 374.9 |
| Log.Lik. | -181.562 |
| F | 51.599 |
| RMSE | 8.51 |
One way to enhance your analysis is by plotting your regression predictions. This is especially helpful in cases like this one where you have a relatively small number of observations:
library(ggrepel)
dc<-data.frame(state = "District of Columbia" , abb = "DC")
states_table<-data.frame("state" = state.name, "abb"= state.abb)|>
bind_rows(dc)
voteshares|>
left_join(states_table, by=join_by(NAME ==state))|>
ggplot(
aes(x = income10k,
y =biden_percent,
label = abb,
)
) +
geom_hline(yintercept = 50, color='lightgrey', lty=2) +
geom_point(aes(color = biden_percent > 50))+
geom_smooth(method='lm', color='darkgrey', lty=2, se=FALSE) +
geom_text_repel(aes(color = biden_percent > 50)) +
theme_bw() +
xlab("Median income (1,000$)") +
ylab("Biden % of two-party vote") +
scale_color_manual(guide='none', values=c('#DC143C', 'blue')) +
theme( panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)Another way to enhance your analysis is to talk about how your model predicted results might change under different scenarios. For instance, it would take a $1,600 (.798 * 1.6 = 1.27) increase in Texas’s median income to move it from a predicted loss to a predicted win:
texas_pred<-predict(model1, newdata=voteshares|>filter(NAME == "Texas"))
needed <- 50-texas_pred
coef<-7.98
(10000 * (needed/7.98)) 1
1596.53
library(ggrepel)
dc<-data.frame(state = "District of Columbia" , abb = "DC")
states_table<-data.frame("state" = state.name, "abb"= state.abb)|>
bind_rows(dc)
voteshares|>
left_join(states_table, by=join_by(NAME ==state))|>
ggplot(
aes(x = income10k,
y =biden_percent,
label = abb,
)
) +
geom_hline(yintercept = 50, color='lightgrey', lty=2) +
geom_point(aes(color = biden_percent > 50))+
geom_smooth(method='lm', color='darkgrey', lty=2, se=FALSE) +
geom_text_repel(aes(color = biden_percent > 50)) +
theme_bw() +
xlab("Median income (1,000$)") +
ylab("Biden % of two-party vote") +
scale_color_manual(guide='none', values=c('#DC143C', 'blue')) +
theme( panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)Question 3
Run the previous model, but include Bachelor’s degrees and population. Put your results in a formatted table and assess your results.
Answer
Again, the code here is relatively straightforward, and the advice above all applies here.
When you have multiple models, its often helpful to include them side-by-side so readers can compare results:
voteshares<-voteshares|>
mutate(bachelors_pct = (bachelors_degree/pop) * 100,
population100k = pop/100000
)
model2<-lm(biden_percent ~ income10k + bachelors_pct + population100k, data= voteshares)
mlist<-list("Model 1" = model1, "Model 2" = model2)
modelsummary(mlist,
coef_rename = c(income10k = "Median income ($10,000)",
bachelors_pct = "% Bachelor's degree or higher",
population100k = "Population (100,000)"
),
title = "% statewide Biden vote in 2020 by median income",
fmt = fmt_statistic(estimate = 2, std.error=2),
notes = "Std. err in parentheses"
)| Model 1 | Model 2 | |
|---|---|---|
| Std. err in parentheses | ||
| (Intercept) | -2.20 | -10.11 |
| (7.33) | (7.64) | |
| Median income ($10,000) | 7.98 | 4.44 |
| (1.11) | (1.76) | |
| % Bachelor's degree or higher | 2.18 | |
| (0.90) | ||
| Population (100,000) | 0.02 | |
| (0.02) | ||
| Num.Obs. | 51 | 51 |
| R2 | 0.513 | 0.577 |
| R2 Adj. | 0.503 | 0.550 |
| AIC | 369.1 | 365.9 |
| BIC | 374.9 | 375.5 |
| Log.Lik. | -181.562 | -177.939 |
| F | 51.599 | 21.407 |
| RMSE | 8.51 | 7.93 |
Question 4
Check your model for heteroskedasticity, non-linearity, or influential outliers. Discuss your interpretation.
Answer
(You may need to adjust the figure width and height to get readable plots here)
#Q4
library(performance)
check_model(model2)There aren’t any major red flags here. There’s no significant heteroskedasticity, the residuals are mostly normal, and there are no signs of significant collinearity or outliers.
check_heteroscedasticity(model2)OK: Error variance appears to be homoscedastic (p = 0.225).
check_normality(model2)OK: residuals appear as normally distributed (p = 0.099).
check_outliers(model2)OK: No outliers detected.
- Based on the following method and threshold: cook (0.9).
- For variable: (Whole model)
check_collinearity(model2)While its not strictly necessary here, we can easily get heteroskedasticity robust standard errors by using the vcov argument in the modelsummary package.
modelsummary(mlist,
coef_rename = c(income10k = "Median income ($10,000)",
bachelors_pct = "% Bachelor's degree or higher",
population100k = "Population (100,000)"
),
title = "% statewide Biden vote in 2020 by median income",
fmt = fmt_statistic(estimate = 2, std.error=2),
notes = "Robust standard errors in parentheses",
vcov = "HC3" # robust to heteroskedasticity, small sample bias, and leverage
)| Model 1 | Model 2 | |
|---|---|---|
| Robust standard errors in parentheses | ||
| (Intercept) | -2.20 | -10.11 |
| (8.89) | (10.38) | |
| Median income ($10,000) | 7.98 | 4.44 |
| (1.42) | (2.14) | |
| % Bachelor's degree or higher | 2.18 | |
| (1.06) | ||
| Population (100,000) | 0.02 | |
| (0.01) | ||
| Num.Obs. | 51 | 51 |
| R2 | 0.513 | 0.577 |
| R2 Adj. | 0.503 | 0.550 |
| AIC | 369.1 | 365.9 |
| BIC | 374.9 | 375.5 |
| Log.Lik. | -181.562 | -177.939 |
| F | 31.531 | 16.139 |
| RMSE | 8.51 | 7.93 |
| Std.Errors | HC3 | HC3 |
Or we can get bootstrapped standard errors:
set.seed(1000)
modelsummary(mlist,
coef_rename = c(income10k = "Median income ($10,000)",
bachelors_pct = "% Bachelor's degree or higher",
population100k = "Population (100,000)"
),
title = "% statewide Biden vote in 2020 by median income",
fmt = fmt_statistic(estimate = 2, std.error=2),
notes = "Bootstrapped standard errors in parentheses",
vcov = "bootstrap"
)| Model 1 | Model 2 | |
|---|---|---|
| Bootstrapped standard errors in parentheses | ||
| (Intercept) | -2.20 | -10.11 |
| (8.02) | (10.12) | |
| Median income ($10,000) | 7.98 | 4.44 |
| (1.25) | (1.77) | |
| % Bachelor's degree or higher | 2.18 | |
| (0.94) | ||
| Population (100,000) | 0.02 | |
| (0.02) | ||
| Num.Obs. | 51 | 51 |
| R2 | 0.513 | 0.577 |
| R2 Adj. | 0.503 | 0.550 |
| AIC | 369.1 | 365.9 |
| BIC | 374.9 | 375.5 |
| Log.Lik. | -181.562 | -177.939 |
| RMSE | 8.51 | 7.93 |
| Std.Errors | Bootstrap | Bootstrap |
Extra code
Logging an IV
It’s probably not necessary in this case, but sometimes we might want to change variables like population to use a log scale. This reduces the influence of outliers, and can reflect the potential non-linear relationship beween an IV and a DV. Note that the log scaling will change our interpretation of the results. When an independent variable has been log-transformed, you can interpret the coefficient values as “a 1% increase in [the IV] is associated with a [coefficient]/100 unit increase in the DV”. So: in model 3, a 1% increase in population is associated with a .01% increase in Biden’s vote share in a state.
model3<-lm(biden_percent ~ income10k + bachelors_pct + log(population100k), data= voteshares)
mlist<-list("Model 1" = model1, "Model 2" = model2, "Model 3" = model3)
modelsummary(mlist,
coef_rename = c(income10k = "Median income ($10,000)",
bachelors_pct = "% Bachelor's degree or higher",
population100k = "Population (100,000)",
`log(population100k)` = "Log population (100,000)"
),
title = "% statewide Biden vote in 2020 by median income",
fmt = fmt_statistic(estimate = 2, std.error=2),
notes = "Std. err in parentheses"
)| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| Std. err in parentheses | |||
| (Intercept) | -2.20 | -10.11 | -14.25 |
| (7.33) | (7.64) | (8.74) | |
| Median income ($10,000) | 7.98 | 4.44 | 4.63 |
| (1.11) | (1.76) | (1.76) | |
| % Bachelor's degree or higher | 2.18 | 2.15 | |
| (0.90) | (0.90) | ||
| Population (100,000) | 0.02 | ||
| (0.02) | |||
| Log population (100,000) | 1.23 | ||
| (1.12) | |||
| Num.Obs. | 51 | 51 | 51 |
| R2 | 0.513 | 0.577 | 0.575 |
| R2 Adj. | 0.503 | 0.550 | 0.548 |
| AIC | 369.1 | 365.9 | 366.2 |
| BIC | 374.9 | 375.5 | 375.8 |
| Log.Lik. | -181.562 | -177.939 | -178.077 |
| F | 51.599 | 21.407 | 21.208 |
| RMSE | 8.51 | 7.93 | 7.95 |
Another way to interpret the effect of a logged IV is through examples and predictions. For instance, we might look at predicted outcomes for all states at different values of population:
library(marginaleffects)
nd<-datagrid(newdata = voteshares,
grid_type = 'counterfactual',
population100k = seq(5, 400, by =1)
)
plot_predictions(model3,
points = .5, # show actual data points
newdata = nd,
by = 'population100k'
) +
theme_bw() +
labs(x = "Population 100K", y='% Biden vote')We can also look at the average effect of increasing population by 1 unit on the predicted outcome:
avg_comparisons(model3,variables='population100k')
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.0507 0.0462 1.1 0.272 1.9 -0.0398 0.141
Term: population100k
Type: response
Comparison: +1
Or we can do something like look at the effect of moving population from its minimum to its maximum:
avg_comparisons(model3,variables=list('population100k' = 'minmax'))
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
5.2 4.73 1.1 0.272 1.9 -4.08 14.5
Term: population100k
Type: response
Comparison: Max - Min